Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle idempotent retry after error during initializing phase #1406

Merged
merged 5 commits into from
Sep 28, 2023

Conversation

peterbroadhurst
Copy link
Contributor

@peterbroadhurst peterbroadhurst commented Sep 24, 2023

Currently if an error occurs from a blockchain connector (EVMConnect for example) on submission of a transaction, then the operation goes immediately into Failed status - even if an idempotencyKey was supplied.

This means that a subsequent attempt to re-submit the transaction fails with:
FF10431: Idempotency key '...' already used for transaction '...'

This is not correct per the architecture described in https://hyperledger.github.io/firefly/head/reference/idempotency.html

Because the error happened during the submission of the idempotent transaction identifier into its FireFly Transaction Manager (FFTM) database, it is safe for a future retry to attempt to insert it again.

At that later point, the connector can return a 409 if it's already received that transaction, or it can successfully accept it into it's DB and process it.

This PR makes the following changes to move us towards the architecture defined behavior in this scenario:

  • Where RunOperation used to return complete as a bool, a new OpPhase has been introduced:
    • OpPhaseComplete - good or bad, the operation is complete
    • OpPhasePending - the operation has been accepted, and it's completion will happen asynchronously
    • OpPhaseInitializing - the operation has not transitioned out of initializing yet
  • Wherever an operation is executed, a new idempotentSubmit parameter is added
    • Set to true for all transactions where idempotencyKey != ""
    • When this is true a transaction will stay in Initializing (not move to Failed) if:
      • An error occurs
      • The phase returned from RunOperation is OpPhaseInitializing
    • Must be supplied externally to operation manager, as it does not have access to the TX object

These changes thus ensure that the Initializing status is held on to longer than is currently true.

Before: Initializing was only the status until the first attempt to submit the transaction was initiated
After: Initializing remains the status until the connector receives the idempotent submission of the TX

…pPhase enum

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
@codecov-commenter
Copy link

codecov-commenter commented Sep 24, 2023

Codecov Report

Merging #1406 (3f2a21f) into main (a2ebfa7) will increase coverage by 0.00%.
Report is 3 commits behind head on main.
The diff coverage is 100.00%.

❗ Current head 3f2a21f differs from pull request most recent head 9e32841. Consider uploading reports for the commit 9e32841 to get more accurate results

@@           Coverage Diff           @@
##             main    #1406   +/-   ##
=======================================
  Coverage   99.98%   99.99%           
=======================================
  Files         321      321           
  Lines       22960    23026   +66     
=======================================
+ Hits        22956    23024   +68     
+ Misses          2        1    -1     
+ Partials        2        1    -1     
Files Coverage Δ
internal/assets/manager.go 100.00% <ø> (ø)
internal/assets/operations.go 100.00% <100.00%> (ø)
internal/assets/token_approval.go 100.00% <100.00%> (ø)
internal/assets/token_pool.go 100.00% <100.00%> (ø)
internal/assets/token_transfer.go 100.00% <100.00%> (ø)
internal/broadcast/manager.go 100.00% <100.00%> (ø)
internal/broadcast/operations.go 100.00% <100.00%> (ø)
internal/contracts/manager.go 100.00% <100.00%> (ø)
internal/contracts/operations.go 100.00% <100.00%> (ø)
internal/events/batch_pin_complete.go 100.00% <100.00%> (ø)
... and 12 more

... and 1 file with indirect coverage changes

@peterbroadhurst
Copy link
Contributor Author

Validated with raw blockchain transactions:

  • Started core
  • Stopped EVMConnect
  • Submitted transaction with empty idempotencyKey
    • Resulting operation status was Failed
  • Submitted transaction with idempotencyKey set to a value
    • Resulting operation status was Initialized
  • Restarted EVMConnect and waited for FF to reconnect its WS
  • Resubmitted transaction with imdempotencyKey set to failed failu
    • Resulting operation status was Succeeded
  • Confirmed FF10431 received if I tried to re-use idempotency key

_, err = am.operations.RunOperation(ctx, opActivatePool(op, pool))
_, err = am.operations.RunOperation(ctx, opActivatePool(op, pool),
false, // TODO: this operation should be made idempotent, but cannot inherit this from the TX per our normal semantics
// as the transaction is only on the submitting side and this is triggered on all parties.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this saying we insert the operation but not the transaction on the non-submitter members? That seems odd/incorrect.

FWIW activation is an idempotent operation at the connector level (so shouldn't be problematic if we activate multiple times) - not sure if that changes how we want to handle this path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my reading of the code - would appreciate if you can confirm for me.

It was based on reading this:

// Message will remain unconfirmed, but plugin will be notified to activate the pool
// This will ultimately trigger a pool creation event and a rewind
state.AddPreFinalize(func(ctx context.Context) error {
if err := dh.assets.ActivateTokenPool(ctx, pool); err != nil {
log.L(ctx).Errorf("Failed to activate token pool '%s': %s", pool.ID, err)
return err
}
return nil
})

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we do eventually persist the token pool TX, after the activation completes:

return em.confirmPool(ctx, existingPool, pool.Event)

So in the happy path, you do end up with a TX on all members, and the activation operations end up showing underneath it.

But it feels like we should probably persist the TX earlier, before we start creating operations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that something you'd like to raise a separate issue for @awrichar ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

…r creating the TX

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
@peterbroadhurst
Copy link
Contributor Author

I extended the fix after testing, as I found in the tokens scenarios if I attempt to recreate the error by having the resolve endpoint down for the signing address, we wrote zero operations to the database.

Then when we came back round ResubmitOperations said it resubmitted zero - correctly - but that's not the information we needed and we returned an idempotency error.

We actually needed to know _resubmitted zero, because no operations exist (vs. all operations reached Pending which was the previous assumption).

So I added a third total return parameter to ResubmitOperations, and in the case it is zero we now re-do the transactions as if we'd generated a new TX (but re-using the TX ID)

@peterbroadhurst
Copy link
Contributor Author

Completed E2E testing (noting it's important to also have hyperledger/firefly-tokens-erc20-erc721#137 applied).

@peterbroadhurst peterbroadhurst marked this pull request as ready for review September 27, 2023 21:00
@peterbroadhurst peterbroadhurst requested a review from a team as a code owner September 27, 2023 21:00
}
switch data.Transfer.Type {
case core.TokenTransferTypeMint:
return nil, false, plugin.MintTokens(ctx, op.NamespacedIDString(), data.Pool.Locator, data.Transfer, data.Pool.Methods)
return nil, core.OpPhaseInitializing, plugin.MintTokens(ctx, op.NamespacedIDString(), data.Pool.Locator, data.Transfer, data.Pool.Methods)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these transfer operations all considered "initializing"? If we get a successful return from the plugin, doesn't that mean the transaction has made it all the way through the token connector to the blockchain connector?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the phase we're in as a result of the error in this layer of the code, rather than the operation state.

e.g. this error occurred before we submitted the operation to the connector, so the calling code can decide if that means we don't store the operation at all, or we store it in state Initializing

Copy link
Contributor Author

@peterbroadhurst peterbroadhurst Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not analyzed very scenario to know if the error path is such, that it could never succeed.
If I had done that (quite large) piece of work everywhere, I could have introduced a phase of OpPhaseRejected to mean - "this is never going to work, tell them it's broken... without even storing the operation at all".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - so the phase is more meaningful in an error return case in order to determine how to respond to the error (not so much in a success case)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed in some managers you used operations.ErrTernary, but not here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry - yes, this should be operations.ErrTernary almost certainly - I'd thought this was an error return.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok - added a commit for that

approval: approval,
mgr: am,
approval: approval,
idempotentSubmit: approval.IdempotencyKey != "",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we really need to store this separate bool here, as we can always compute it from the approval above at the point we need it (but I'm fine either way)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have to do more investigation to convince myself of that.. and if I did so, I'd probably want to rename approval to approvalInput (rather than the implying it's the actual approval object).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So... if it's ok, think I'd like to leave it as is for this PR

Copy link
Contributor

@awrichar awrichar Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But yes, the thing we store is the input DTO, so possible it's named incorrectly.

if resubmitErr != nil {
// Error doing resubmit, return the new error
return false, resubmitErr
}
if len(resubmitted) > 0 {
if total == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic feels slightly confusing to me. I guess what we're trying to differentiate is these 3 cases:

  • no operations have ever been started for this transaction
  • some operations were previously initialized for this transaction, but we think they weren't fully submitted, so we've now retried them
  • some operations were previously initialized for this transaction, but we think they're all currently in flight, so we can't do anything but continue waiting

We're expressing these 3 states with a combination of two integer returns... wonder if there's a cleaner way to express it?

Copy link
Contributor Author

@peterbroadhurst peterbroadhurst Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - agreed.

I think pushing all this into a combined operation+transaction manager as a function that has plug points for the differentiation, rather than the whole of this section being boilerplate copied in lots of places would be good.

Choice is whether this little bit of extension to the boilerplate is the point to take on the bigger refactor.

I could merge the SubmitNewTransaction and ResubmitOperations functions to a new function, contains this logic in an opinionated way and take in the idempotencyKey out of the input-request, and return:

  1. It's all done now - please return without error to the user
  2. Nothing was done, here's the existing transaction ID
  3. We've never heard of this idempotency key, here's a shiny new TX
  4. This is already submitted completely, 409 to your user
  5. Something else went wrong

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, point taken - it may not be worth continuing to iterate on this interface until we have the cycles to do a larger refactor. We should capture that next step as a task though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raised #1410

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
Copy link
Contributor

@awrichar awrichar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I think all remaining feedback has been captured in new issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants